This is part two of an analysis to to build a model that can predict if students are consuming dangerous amounts of alcohol. If you have already viewered part one of this analysis you will notice that there are many similarities. The primary difference is that there are only two risk groups in this analysis (Low and High) compared to the three risk groups (Low, Medium, High) that were in part one. The goal of this analysis is to see if changing the parameters of the problem can lead to a more predictive model. The model in part one struggled to identify students in the higher risk groups, partly because of the amount of data that we are working with, and how unbalanced the data is. By simifying the problem, the model will, ideally, become more accurate, and therefore more useful.
## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
It’s a rather small dataset that we are working with, and I’m worried that the data is not optimally reflective of the average student. We are only looking at students who have taken a Portugese class, an elective course, for which there might be a ‘type’ of student that takes this class. I expect that we would have a more accurate sense of the average student if we had data from a mandatory class, such as math or English. Nonetheless, I hope we will have some interesting and useful findings.
## [1] "Percent of students in each drinking level:"
##
## 1 2 3 4 5
## 0.69491525 0.18644068 0.06625578 0.02619414 0.02619414
The majority of the students consume very little, if any, alcohol during the week, but about 5% of students drink significant amounts (values >= 4).
## [1] "Percent of students in each drinking level:"
##
## 1 2 3 4 5
## 0.38058552 0.23112481 0.18489985 0.13405239 0.06933744
Clearly students are drinking much more alcohol on weekends compared to weekdays. The percent of signifiant drinkers (values >= 4) jumped from about 5% to just over 20%. Those drinking little to no alcohol (value = 1) was reduced by nearly half (from 69% to 38%).
## [1] "Correlation between weekday and weekend:"
## [1] 0.6165614
Of the 649 students, 241 (or 37.1%) of them drink little to no alcohol. 210 of the 451 (46.6%) students who do not drink during the week, consume some alcohol on the weekends. It is very rarely the case that students drink more during the week than on weekends.
## [1] "Percent of students in each group:"
##
## 2 3 4 5 6 7
## 0.371340524 0.178736518 0.152542373 0.112480740 0.077041602 0.049306626
## 8 9 10
## 0.026194145 0.009244992 0.023112481
For the remainder of this analysis, we are going to be dividing the students into one of two groups. ‘Low Risk’ for students with total alcohol consumption values <= 5 and ‘High Risk’ for values > 5 or having either the ‘Weekday’ or ‘Weekend’ value >= to 4.
## [1] "Percent of students in each risk group:"
##
## Low High
## 0.770416 0.229584
Although it is good that the majority of the students are in the low risk group, it will be very important to build a model that can accurately predict which students are in the high risk group, so that they can receive the guidance they need to stop this detrimental behaviour.
## [1] "Number of students attending each school"
##
## Gabriel Pereira Mousinho da Silveira
## 423 226
##
## Gabriel Pereira Mousinho da Silveira
## 0.651772 0.348228
Althought there are more students attending Gabriel Pereira, the relative number of students in each risk group is about the same.
## [1] "Number of students of each gender:"
##
## Female Male
## 383 266
##
## Female Male
## 0.5901387 0.4098613
Despite 59% of the students being female, 68% of those in the high risk group are males.
## [1] "Number of students of each age:"
##
## 15 16 17 18 19 20 21 22
## 112 177 179 140 32 6 2 1
##
## 15 16 17 18 19 20
## 0.172573190 0.272727273 0.275808937 0.215716487 0.049306626 0.009244992
## 21 22
## 0.003081664 0.001540832
## [1] "Correlation between age and risk"
## [1] 0.1207322
Although there are fewer older students to make this conclusion with, it seems that as students age, they drink more.
## [1] "Number of students of each type of address:"
##
## Rural Urban
## 197 452
##
## Rural Urban
## 0.3035439 0.6964561
Many more students live in urban areas, but this does not tell us anything about the risk of alcohol abuse.
## [1] "Number of students of each family size:"
##
## Greater than 3 Less than 3
## 457 192
##
## Greater than 3 Less than 3
## 0.7041602 0.2958398
Students from smaller families are slightly more likely to be at risk of alcohol abuse.
## [1] "Number of students of each parental marriage status group:"
##
## Apart Together
## 80 569
##
## Apart Together
## 0.1232666 0.8767334
I was expecting to see an effect here, but there doesn’t appear to be one.
## [1] "Number of students for each level of education (Mother):"
##
## None 4th Grade 5th to 9th Grade
## 6 143 186
## Secondary Education Higher Education
## 139 175
##
## None 4th Grade 5th to 9th Grade
## 0.009244992 0.220338983 0.286594761
## Secondary Education Higher Education
## 0.214175655 0.269645609
Oddly the relationship between risk and mother’s education level seems to alternate with each level of education. 4th Grade & Secondary Education: High risk, 5th-9th Grade & Higher Education: Low risk.
## [1] "Number of students for each level of education (Father):"
##
## None 4th Grade 5th to 9th Grade
## 7 174 209
## Secondary Education Higher Education
## 131 128
##
## None 4th Grade 5th to 9th Grade
## 0.01078582 0.26810478 0.32203390
## Secondary Education Higher Education
## 0.20184900 0.19722650
Differing from the mother’s education level, there doesn’t appear to be any relationship between a father’s education level and their child’s drinking habits.
## [1] "Correlation between mothers' and fathers' education levels:"
## [1] 0.6474766
There is a reasonably strong correlation between a mother’s and a father’s education level. This helps to explain the similarities in the plots we just saw.
## [1] "Number of students for each type of job (Mother):"
##
## At_Home Health Other Services Teacher
## 135 48 258 136 72
##
## At_Home Health Other Services Teacher
## 0.20801233 0.07395994 0.39753467 0.20955316 0.11093991
There doesn’t appear to be any strong relationship here.
## [1] "Number of students for each type of job (Father):"
##
## At_Home Health Other Services Teacher
## 42 23 367 181 36
##
## At_Home Health Other Services Teacher
## 0.06471495 0.03543914 0.56548536 0.27889060 0.05546995
This looks more significant. If a father works in services, it seems that their child is more likely to abuse alcohol.
## [1] "Number of students for each type of reason:"
##
## Course Perference Close to Home Other School Reputation
## 285 149 72 143
##
## Course Perference Close to Home Other School Reputation
## 0.4391371 0.2295840 0.1109399 0.2203390
If the school was chosen based on its reputation, the student appears less likely to abuse alcohol.
## [1] "Number of students for each type of guardian:"
##
## Father Mother Other
## 153 455 41
##
## Father Mother Other
## 0.23574730 0.70107858 0.06317411
It is interesting to see that so many students chose their mother as their primary guardian, yet so few students having separated parents. If the guardian is ‘Other,’ the student is more likely to be in a higher risk group.
## [1] "Number of students for each travel time group:"
##
## Less than 15 15 to 30 30 to 60 More than 60
## 366 213 54 16
##
## Less than 15 15 to 30 30 to 60 More than 60
## 0.56394453 0.32819723 0.08320493 0.02465331
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.532 2.000 4.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.691 2.000 4.000
## [1] "Correlation between travel time and risk"
## [1] 0.08954301
We can see from the mean values, that as travel time increaseses, students are slightly more likely to abuse alcohol. However, the relationship is not overly strong as seen by the median, 3rd quartile values, and correlation.
## [1] "Number of students for each study time group:"
##
## Less than 2 Hours 2 to 5 Hours 5 to 10 Hours
## 212 305 97
## More than 10 Hours
## 35
##
## Less than 2 Hours 2 to 5 Hours 5 to 10 Hours
## 0.32665639 0.46995378 0.14946071
## More than 10 Hours
## 0.05392912
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.022 2.000 4.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.624 2.000 4.000
## [1] "Correlation between time spent studying and risk:"
## [1] -0.2018618
There is a reasonable relationship here. If a student spends less time studying, s/he is more likely to abuse alcohol. Note: The values for ‘Time Spent Studying’ (1,2,3,4) represent ‘Less than 2 Hours’, ‘2 to 5 Hours’, ‘5 to 10 Hours’, and ‘More than 10 Hours’.
## [1] "Number of students for each failure group:"
##
## 0 1 2 3
## 549 70 16 14
##
## 0 1 2 3
## 0.84591680 0.10785824 0.02465331 0.02157165
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.184 0.000 3.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.349 0.000 3.000
## [1] "Correlation between number of failed classes and risk:"
## [1] 0.1170598
As a student fails more classes, it seems that they are more likely to abuse alcohol. To simplify the relationship, let’s look at students having failed at least one class versus their risk group.
## [1] "Number of students for each failed group:"
##
## No Yes
## 549 100
##
## No Yes
## 0.8459168 0.1540832
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.132 1.000 2.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.228 1.000 2.000
This should help to make the relationship look clearer. If a student has failed at least one class they are slightly more likely to abuse alcohol.
## [1] "Number of students for each educational support group:"
##
## No Yes
## 581 68
##
## No Yes
## 0.8952234 0.1047766
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.114 1.000 2.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.074 1.000 2.000
No strong relationship here.
## [1] "Number of students for each educational support group:"
##
## No Yes
## 251 398
##
## No Yes
## 0.3867488 0.6132512
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.642 2.000 2.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.517 2.000 2.000
Student who did not receive education support from family are more likely to abuse alcohol.
## [1] "Number of students for each paying group:"
##
## No Yes
## 610 39
##
## No Yes
## 0.93990755 0.06009245
Paying for extra classes doesn’t really change a student’s drinking habits.
## [1] "Number of students for each activity group:"
##
## No Yes
## 334 315
##
## No Yes
## 0.5146379 0.4853621
Students in the high risk group are more likely to participate in extra-cirricular acitivities.
## [1] "Number of students for each nursery group:"
##
## No Yes
## 128 521
##
## No Yes
## 0.1972265 0.8027735
Attending nursery school as a young child looks to slightly decrease the likelihood of drinking excessive when older.
## [1] "Number of students for each education group:"
##
## No Yes
## 69 580
##
## No Yes
## 0.1063174 0.8936826
Students that are less inclinded to attend higher education are more likely to drink excessive amounts of alcohol.
## [1] "Number of students that have internet at home:"
##
## No Yes
## 151 498
##
## No Yes
## 0.2326656 0.7673344
There doesn’t seem to be much of a relationship between alcohol consumption and internet access at home, however, I am surprised by the number of students that do not have access at home (data is from 2008)
## [1] "Number of students for each relationship group:"
##
## No Yes
## 410 239
##
## No Yes
## 0.6317411 0.3682589
No strong relationship between having a significant other and alcohol consumption.
## [1] "Number of students for each quality group:"
##
## Very Bad Bad Average Good Excellent
## 22 29 101 317 180
##
## Very Bad Bad Average Good Excellent
## 0.03389831 0.04468413 0.15562404 0.48844376 0.27734977
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 3.972 5.000 5.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.792 5.000 5.000
Students in the high risk group, typically have worse family relationships.
## [1] "Number of students for each time group:"
##
## Very Low Low Average High Very High
## 45 107 251 178 68
##
## Very Low Low Average High Very High
## 0.06933744 0.16486903 0.38674884 0.27426810 0.10477658
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.132 4.000 5.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.342 4.000 5.000
## [1] "Correlation between Amount of free time after school and risk:"
## [1] 0.08420331
There is a weak, but positive relationship between amount of free time after school and alcohol consumption.
## [1] "Number of students for each social group:"
##
## Very Low Low Average High Very High
## 48 145 205 141 110
##
## Very Low Low Average High Very High
## 0.07395994 0.22342065 0.31587057 0.21725732 0.16949153
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.974 4.000 5.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.893 5.000 5.000
## [1] "Correlation between frequency of going out with friends and risk"
## [1] 0.328838
This could be the most differentiating feature that we have seen yet. We can clearly see that students who go out with their friends more often are more likely to be in the high risk group.
## [1] "Number of students for each health group:"
##
## Very Bad Bad Mediocre Good Very Good
## 90 78 124 108 249
##
## Very Bad Bad Mediocre Good Very Good
## 0.1386749 0.1201849 0.1910632 0.1664099 0.3836672
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 4.000 3.456 5.000 5.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.805 5.000 5.000
## [1] "Correlation between health and risk:"
## [1] 0.1016733
We could be seeing the bias of a personal survey here. Despite having a very unhealthy habit, those in the high risk group consider themselves to be the healthiest.
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 3.264 4.000 32.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 4.000 4.987 8.000 22.000
## [1] "Correlation between number of absences and risk:"
## [1] 0.1562277
Students in the high risk group seem to miss more classes than those in the low risk group.
## [1] "Period 1"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 11.68 14.00 19.00
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 9.00 10.00 10.45 12.00 17.00
## [1] "Period 2"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 11.89 14.00 19.00
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 10.00 10.51 12.00 18.00
## [1] "Period 3"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 12.26 14.00 19.00
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 11.00 10.72 12.00 19.00
Students in the high risk group have the worst grades on average.
## [1] "Period 2 grades minus period 1 grades"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.000 -1.000 0.000 0.204 1.000 11.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.0000 -1.0000 0.0000 0.0604 1.0000 5.0000
## [1] "Correlation with risk:"
## [1] -0.04085654
## [1] "Period 3 grades minus period 2 grades"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.000 0.000 0.000 0.372 1.000 3.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.0000 0.0000 0.0000 0.2148 1.0000 6.0000
## [1] "Correlation with risk:"
## [1] -0.05177298
## [1] "Period 3 grades minus period 1 grades"
## df$Risk: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.000 0.000 1.000 0.576 2.000 11.000
## --------------------------------------------------------
## df$Risk: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.0000 0.0000 1.0000 0.2752 1.0000 6.0000
## [1] "Correlation with risk:"
## [1] -0.06954099
The differences between the two groups was much closer in the previous sets of plots. It seems that students’ grades improve in a similar fashion, no matter what their drinking habits are.
It is very clear here that students in the high risk group typically have below median grades.
Note: There are a number of combinations of features that I compared, but I will only present the plots that show a stronger relationship.
By looking at the top-left and botton-right of the plot, we can see the pattern most clearly. Students who go out more often with their friends, and study less, are more likely to abuse alcohol, than those who have the opposite habits.
The main grouping that I see here is in the top right, which represents males who go out more often with their friends (higher risk).
## [1] "Perform recursive feature engineering."
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.8385 0.5107 0.05064 0.1529 *
## 2 0.8346 0.4952 0.04982 0.1516
## 3 0.8327 0.4782 0.05132 0.1697
## 4 0.8346 0.4879 0.05302 0.1661
## 5 0.8346 0.4911 0.05302 0.1659
## 6 0.8365 0.5030 0.04818 0.1450
## 7 0.8346 0.4947 0.04640 0.1408
## 8 0.8308 0.4751 0.05270 0.1665
## 9 0.8308 0.4717 0.05719 0.1799
## 10 0.8231 0.4557 0.06000 0.1865
## 56 0.8250 0.4314 0.03563 0.1184
##
## The top 1 variables (out of 1):
## combos
## [1] "Features ranked by importance:"
## combos maleOut simple sex
## 29.301525 16.143811 11.092364 2.376313
Train the random forest model.
## + Fold01: mtry=1
## - Fold01: mtry=1
## + Fold02: mtry=1
## - Fold02: mtry=1
## + Fold03: mtry=1
## - Fold03: mtry=1
## + Fold04: mtry=1
## - Fold04: mtry=1
## + Fold05: mtry=1
## - Fold05: mtry=1
## + Fold06: mtry=1
## - Fold06: mtry=1
## + Fold07: mtry=1
## - Fold07: mtry=1
## + Fold08: mtry=1
## - Fold08: mtry=1
## + Fold09: mtry=1
## - Fold09: mtry=1
## + Fold10: mtry=1
## - Fold10: mtry=1
## Aggregating results
## Fitting final model on full training set
Train the K-Nearest Neighbours model.
## + Fold01: k=10
## - Fold01: k=10
## + Fold02: k=10
## - Fold02: k=10
## + Fold03: k=10
## - Fold03: k=10
## + Fold04: k=10
## - Fold04: k=10
## + Fold05: k=10
## - Fold05: k=10
## + Fold06: k=10
## - Fold06: k=10
## + Fold07: k=10
## - Fold07: k=10
## + Fold08: k=10
## - Fold08: k=10
## + Fold09: k=10
## - Fold09: k=10
## + Fold10: k=10
## - Fold10: k=10
## Aggregating results
## Fitting final model on full training set
Train the Support Vector Machines model.
## + Fold01: sigma=1, C=1, Weight=1
## - Fold01: sigma=1, C=1, Weight=1
## + Fold02: sigma=1, C=1, Weight=1
## - Fold02: sigma=1, C=1, Weight=1
## + Fold03: sigma=1, C=1, Weight=1
## - Fold03: sigma=1, C=1, Weight=1
## + Fold04: sigma=1, C=1, Weight=1
## - Fold04: sigma=1, C=1, Weight=1
## + Fold05: sigma=1, C=1, Weight=1
## - Fold05: sigma=1, C=1, Weight=1
## + Fold06: sigma=1, C=1, Weight=1
## - Fold06: sigma=1, C=1, Weight=1
## + Fold07: sigma=1, C=1, Weight=1
## - Fold07: sigma=1, C=1, Weight=1
## + Fold08: sigma=1, C=1, Weight=1
## - Fold08: sigma=1, C=1, Weight=1
## + Fold09: sigma=1, C=1, Weight=1
## - Fold09: sigma=1, C=1, Weight=1
## + Fold10: sigma=1, C=1, Weight=1
## - Fold10: sigma=1, C=1, Weight=1
## Aggregating results
## Fitting final model on full training set
Train the extreme gradient boosting model.
## + Fold01: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold01: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold02: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold02: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold03: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold03: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold04: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold04: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold05: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold05: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold06: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold06: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold07: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold07: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold08: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold08: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold09: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold09: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## + Fold10: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## - Fold10: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1
## Aggregating results
## Fitting final model on full training set
## [1] "Random Forest Model:"
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Low High
## Low 70.2 10.6
## High 6.7 12.5
##
## Accuracy (average) : 0.8269
The accuracy is much higher compared to the previous analysis, but nearly half of the high risk students are being labels as low risk…not something that we wanted to happen.
## [1] "K-Nearest Neighbours Model:"
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Low High
## Low 63.8 8.7
## High 13.1 14.4
##
## Accuracy (average) : 0.7827
The accuracy is a little lower compared to the Random Forest model, but the high risk students were predicted more accurately. I would call this a good trade off.
## [1] "Support Vector Machines Model:"
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Low High
## Low 71.9 11.7
## High 5.0 11.3
##
## Accuracy (average) : 0.8327
High risk students were predicted with slightly worse than 50% accuracy, but low risk students were predicted very accurately.
## [1] "Extreme Gradient Boosting Model:"
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Low High
## Low 66.3 9.2
## High 10.6 13.8
##
## Accuracy (average) : 0.8019
This is rather similar to the K-Nearest Neighbours model. High Risk students were predicted with some accuracy, and the overall accuracy is close to 80%.
## [1] "Random Forest Model:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 80 17
## High 20 12
##
## Accuracy : 0.7132
## 95% CI : (0.627, 0.7893)
## No Information Rate : 0.7752
## P-Value [Acc > NIR] : 0.9605
##
## Kappa : 0.2062
## Mcnemar's Test P-Value : 0.7423
##
## Sensitivity : 0.8000
## Specificity : 0.4138
## Pos Pred Value : 0.8247
## Neg Pred Value : 0.3750
## Prevalence : 0.7752
## Detection Rate : 0.6202
## Detection Prevalence : 0.7519
## Balanced Accuracy : 0.6069
##
## 'Positive' Class : Low
##
## [1] "K-Nearest Neighbour Model:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 72 13
## High 28 16
##
## Accuracy : 0.6822
## 95% CI : (0.5944, 0.7613)
## No Information Rate : 0.7752
## P-Value [Acc > NIR] : 0.99450
##
## Kappa : 0.2296
## Mcnemar's Test P-Value : 0.02878
##
## Sensitivity : 0.7200
## Specificity : 0.5517
## Pos Pred Value : 0.8471
## Neg Pred Value : 0.3636
## Prevalence : 0.7752
## Detection Rate : 0.5581
## Detection Prevalence : 0.6589
## Balanced Accuracy : 0.6359
##
## 'Positive' Class : Low
##
## [1] "Support Vector Machines Model:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 87 18
## High 13 11
##
## Accuracy : 0.7597
## 95% CI : (0.6766, 0.8305)
## No Information Rate : 0.7752
## P-Value [Acc > NIR] : 0.7058
##
## Kappa : 0.2656
## Mcnemar's Test P-Value : 0.4725
##
## Sensitivity : 0.8700
## Specificity : 0.3793
## Pos Pred Value : 0.8286
## Neg Pred Value : 0.4583
## Prevalence : 0.7752
## Detection Rate : 0.6744
## Detection Prevalence : 0.8140
## Balanced Accuracy : 0.6247
##
## 'Positive' Class : Low
##
## [1] "Extreme Gradient Boosting Model:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 66 16
## High 34 13
##
## Accuracy : 0.6124
## 95% CI : (0.5227, 0.6969)
## No Information Rate : 0.7752
## P-Value [Acc > NIR] : 0.99999
##
## Kappa : 0.0887
## Mcnemar's Test P-Value : 0.01621
##
## Sensitivity : 0.6600
## Specificity : 0.4483
## Pos Pred Value : 0.8049
## Neg Pred Value : 0.2766
## Prevalence : 0.7752
## Detection Rate : 0.5116
## Detection Prevalence : 0.6357
## Balanced Accuracy : 0.5541
##
## 'Positive' Class : Low
##
## [1] "An Ensemble of all the Models:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 70 12
## High 30 17
##
## Accuracy : 0.6744
## 95% CI : (0.5864, 0.7543)
## No Information Rate : 0.7752
## P-Value [Acc > NIR] : 0.996898
##
## Kappa : 0.2345
## Mcnemar's Test P-Value : 0.008712
##
## Sensitivity : 0.7000
## Specificity : 0.5862
## Pos Pred Value : 0.8537
## Neg Pred Value : 0.3617
## Prevalence : 0.7752
## Detection Rate : 0.5426
## Detection Prevalence : 0.6357
## Balanced Accuracy : 0.6431
##
## 'Positive' Class : Low
##
The initial hope that these models would be more useful than the ones of the previous analysis did not come to fruition. None of these models had a P-value (Accuracy > No Information Rate) that was statistically significant, >= 0.05.
I am not entirely sure what is missing for these models to become useful and statistically significant. Of course it is easier to say that more data would have helped, but what else? New features, such as “has older sibling” (someone to buy them alcohol), “frequency of parents’ consumption of alcohol” (is drinking alcohol something the students see reguarly and become influenced by this habit), or “has their own car” (easier to get up to mischief if they have the freedom of mobility) could have been useful. To reiterate what was said at the start of this analysis, their might be some bias in the data. Since the data is about students taking a Portuguese language course, rather than a mandatory class, such as English or Math, we are limited to building models from students that share one particular interest, rather than observing the full range of students.